Data Visualization Techniques

Venustiano Soancatl Aguilar

Content

  • The grammar of graphics
  • The major components of layers
  • Hands on practice
  • Visualizations based on the gg approach

The grammar of graphics


The grammar of graphics is about grammatical rules for creating perceivable graphs, or what we call graphics. (Leland Wilkinson, 2005).


Take the analogy: good grammar is just the first step in creating a good sentence.

An Object-Oriented Graphics System

  1. Specification
    1. DATA : a set of data operations that create variables from datasets,
    2. TRANS : variable transformations (e.g., rank),
    3. SCALE : scale transformations (e.g., log),
    4. COORD : a coordinate system (e.g., polar),
    5. ELEMENT : graphs (e.g., points) and their aesthetic attributes (e.g., color),
    6. GUIDE : one or more guides (axes, legends, etc.).
  2. Assembly
  3. Display

Graphics Pipeline

  • Algebra, the operations that allow us to combine variables and specify dimensions of graphs.
  • Scales involves the representation of variables on measured dimensions.
  • Statistics covers the functions that allow graphs to change their appearance and representation schemes.
  • Geometry covers the creation of geometric graphs from variables.

A layered grammar of graphics

Layers of the grammar of graphics


A layer is composed of

  1. data and aesthetic mappings
  2. a geometric object
  3. a statistical transformation
  4. a position adjustment

1. Data and aesthetic mapping

Scales — Mapping Data to Aesthetic Attributes

What is a Scale?

A scale defines how data values are translated into visual properties (aesthetics).

Each aesthetic — color, size, shape, position, etc. — has a corresponding scale.

You can customize scales to control:

    Color palettes
    Axis limits and breaks
    Legend appearance
    Transformations (e.g., log, sqrt)

Mapping data to aesthetics

  • Left, continuous data to size and color
  • Right, discrete data to shape and color

Aestetic mappings

2. Geometric objects

Geometric object

A sample of geometric objects

Graphical primitives

  • geom_path()
  • geom_rect()
  • geom_poligon()

One variable

  • Discrete
    • geom_bar()
  • Continuous
    • geom_histogram()
    • geom_density()

Two variables

  • Both continuous
    • geom_smooth()
    • geom_point()
  • At least one discrete
    • geom_count()
    • geom_jitter()
  • One continuous one discrete
    • geom_boxplot().
    • geom_violin()

Three variables

  • geom_contour()
  • geom_tile()
  • geom_raster()

Aesthetics mapping in practice

library(dviz.supp)
library(forcats)
library(lubridate)

if (!requireNamespace("gt")) install.packages("gt")
library(gt)
Daily temperature data
station_id month day temperature flag date location
USC00042319 01 1 51.0 S 0-01-01 Death Valley
USC00042319 01 2 51.2 S 0-01-02 Death Valley
USC00042319 01 3 51.3 S 0-01-03 Death Valley
USC00042319 01 4 51.4 S 0-01-04 Death Valley
USC00042319 01 5 51.6 S 0-01-05 Death Valley
USC00042319 01 6 51.7 S 0-01-06 Death Valley

Mapping and geometry

p <- ggplot(temps_long, 
            aes(x = date, 
                y = temperature, 
                color = location)
            ) +
  geom_line(linewidth = 1) +
  scale_x_date(name = "month", 
               limits = c(ymd("0000-01-01"), ymd("0001-01-04")),
               breaks = c(ymd("0000-01-01"), ymd("0000-04-01"), ymd("0000-07-01"),
                          ymd("0000-10-01"), ymd("0001-01-01")),
               labels = c("Jan", "Apr", "Jul", "Oct", "Jan"), expand = c(1/366, 0)) + 
  scale_y_continuous(limits = c(19.9, 107),
                     breaks = seq(20, 100, by = 20),
                     name = "temperature (°F)") +
  scale_color_OkabeIto(order = c(1:3, 7), name = NULL) +
  theme_dviz_grid() +
  theme(legend.title.align = 0.5)

Temperature plot

Seaborn and the Grammar of Graphics

# Create plot
fig, ax = plt.subplots(figsize=(9, 5))

# Use seaborn lineplot; pass palette by mapping
sns.lineplot(
    data=lf,
    x='date',
    y='temperature',
    hue='location',
    palette=palette_map,
    linewidth=1.5,  # similar to geom_line linewidth
    ax=ax
)

# X-axis limits and breaks (use valid years 2000-01-01 to 2001-01-04)
xmin = pd.to_datetime("2000-01-01")

Temperature plot using Seaborn

(np.float64(10957.0), np.float64(11326.0))
(19.9, 107.0)

Changing the geometry to heatmap

Preprocessing:

  • Compute mean by location & month
  • Replace month numbers with names
Mean temperature per month
location month mean
Death Valley Jan 53.45161
Death Valley Feb 59.94483
Death Valley Mar 68.44839
Death Valley Apr 76.29333
Death Valley May 86.60645
Death Valley Jun 95.54667

Aesthetics mapping and geometry

p <- ggplot(mean_temps, 
            aes(x = month, y = location, fill = mean)) + 
     geom_tile(width = .95, height = 0.95) +
     scale_fill_viridis_c(option = "B", begin = 0.15, end = 0.98,
                       name = "temperature (°F)") + 
     scale_y_discrete(name = NULL) +
     ...

3. Statistical transformations

Common statistical transformations


ggplot2 stat_ functions
Table adapted from Hadley Wickham (2016),
Name Description
bin Divide continuous range into bins, and count number of points in each
boxplot Compute statistics necessary for boxplot
contour Calculate contour lines
density Compute 1d density estimate
identity Identity transformation, f(x) = x
jitter Jitter values by adding small random value
qq Calculate values for quantile-quantile plot
quantile Quantile regression
smooth Smoothed conditional mean of y given x
summary Aggregate values of y for given x
unique Remove duplicated observations

Contours

Plot the blue jay relationship between body mass and head length.


Blue jay dataset
BirdID KnownSex BillDepth BillWidth BillLength Head Mass Skull Sex
0000-00000 M 8.26 9.21 25.92 56.58 73.30 30.66 1
1142-05901 M 8.54 8.76 24.99 56.36 75.10 31.38 1
1142-05905 M 8.39 8.78 26.07 57.32 70.25 31.25 1
1142-05907 F 7.78 9.30 23.48 53.77 65.50 30.29 0
1142-05909 M 8.71 9.84 25.47 57.32 74.90 31.85 1
1142-05911 F 7.28 9.30 22.25 52.25 63.90 30.00 0

Contour plot, first version

blue_jays_base <- ggplot(blue_jays, aes(Mass, Head)) + 
  scale_x_continuous(limits = c(57, 82), expand = c(0, 0), name = "body mass (g)") +
  scale_y_continuous(limits = c(49, 61), expand = c(0, 0), name = "head length (mm)" ) +
  theme_dviz_grid()

blue_jays_base + 
  stat_density_2d(color = "black", size = 0.4, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = 1/3)

Apply some shading

blue_jays_base + 
  stat_density_2d(aes(fill = ..level..), geom = "polygon", color = "black", size = 0.15, binwidth = 0.004) +
  geom_point(color = "black", size = 1.5, alpha = .4) +
  scale_fill_gradient(low = "grey95", high = "grey70", guide = "none")

Grouping by sex

blue_jays_base + 
  aes(color = KnownSex) +
  stat_density_2d(size = 0.4, binwidth = 0.006) +
  geom_point(size = 1.5, alpha = 0.7) +
  ...

Bins

Common applications:

  • Histograms
  • Contours
  • Heatmaps, aggregate values into grid cells to display intensity across two dimensions
  • Temporal aggregation
  • Large-data intensity approximation

Binning in mass spectrometry data

Prompt: Given a pandas dataframes with more than 200 million rows and an ‘mz’ column having more than 26 million unique values. How can the table be aggregated in such a way that we can create a heat map with mz on the vertical axis, time on the horizontal axis and intensity on the ‘z’ axis (color)?

Id Time scanid index intensity mz
1 0.312346 35 376857 9 1548.487069
1 0.312346 38 2796 9 99.330436
1 0.312346 38 274380 9 979.736426
1 0.312346 38 313091 9 1179.351225
1 0.312346 39 322655 9 1231.517479
6481 691.492002 916 55969 86 200.037732
6481 691.492002 916 74584 138 243.540075
6481 691.492002 916 92301 62 288.915998
6481 691.492002 916 218889 53 725.852351
6481 691.492002 917 215210 99 710.363398

AI Response

To aggregate such a large Pandas DataFrame for creating a heatmap, you can use binning and grouping techniques to reduce the data into manageable chunks. Here’s a general approach to achieve this:


# Step 1: Downsample or aggregate the data if necessary to avoid memory issues
# Step 2: Create a pivot table (heatmap structure) by aggregating intensity
heatmap_data = super_df.pivot_table(index=pd.cut(super_df['mz'], bins=mz_bins), 
                              columns=pd.cut(super_df['Time'], bins=time_bins), 
                              values='intensity', 
                              aggfunc='mean')  # You can change 'mean' to 'sum' if appropriate

# Step 3: Apply a logarithmic transformation to highlight minority values
# Step 4: Plot the heatmap
plt.figure(figsize=(17, 10))  # Adjust figure size as needed
sns.heatmap(heatmap_data_log, cmap='magma', norm=None,cbar_kws={'label': 'Log(Intensity)'})  
# 'magma' gives more emphasis on high values

# Step 6: Show the plot
plt.show()

Result

4. Position adjustment

Position Adjustments in ggplot2

Position Adjustments in ggplot2
Position Description Commonly Used With
identity No adjustment — geoms are placed exactly where data specifies. geom_point(), geom_bar()
stack Stacks elements vertically along the y-axis. geom_bar(), geom_area()
fill Like 'stack', but scales bars to show proportions (fills to 100%). geom_bar(), geom_area()
dodge Places overlapping objects side-by-side for comparison. geom_bar(), geom_boxplot()
jitter Adds small random variation to reduce overplotting. geom_point()
nudge Moves text or labels slightly to improve readability. geom_text(), geom_label()

Position dodge

Using jitter to deal with occlusion

Partial transparency

Jitter

Using jitter to visualize outliers

The facets layer

  • Faceting = splitting a dataset into subsets and drawing the same plot design for each subset.
  • It’s a declarative way to show conditional relationships: “plot y vs x for each level of variable z”.
  • Benefits: compares patterns across groups while keeping scales and geoms consistent.
  • The row and column variables must be categorical.

Violin plots example

Heatmap facets example

Plotly Express — interactive faceting

df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip", facet_col="sex", color="smoker",
                 title="Tip vs Bill faceted by sex")
fig.update_layout(legend_title_text="Smoker");
fig.show()

Coodinate systems

  • In the Grammar of Graphics, the coordinate system defines how data coordinates are mapped to the 2D plane of the plot.
  • It determines:
    • The axes (orientation and scaling)
    • The shape of geometric objects
    • The relationships between x and y aesthetics

Importance of coordinate systems

  • They control how data is drawn, not what data is shown.
  • Changing coordinates can:
    • Flip, stretch, or transform plots
    • Reveal patterns not visible in the default Cartesian system
    • Support specialized visualizations (like polar plots)

Common Coordinate Systems

Common Coordinate Systems in ggplot2
Function Description
coord_cartesian() Default Cartesian coordinates; standard x-y axes.
coord_flip() Swaps x and y axes — useful for horizontal bar plots.
coord_fixed() Ensures fixed aspect ratio between x and y units.
coord_polar() Converts Cartesian to polar coordinates (e.g., pie charts).
coord_quickmap() Approximates a Mercator projection — great for maps.
coord_trans() Applies a mathematical transformation to axes (e.g., log scale).

Example: Polar Coordinates

  • Synthetic daily temperature data
  • time series -> circular layout,
  • The x-axis (day of year) -> angle.
  • The y-axis (temperature) -> distance from center.

RUG plot app

Create facets, violin plots, tikz using:

The Grammar o Graphics in python

plotnine